Protein function prediction and classification using uncertainty

نویسندگان

  • James R. Bradford
  • Chris J. Needham
  • Andrew J. Bulpitt
  • David R. Westhead
چکیده

The overall aim of this project is to investigate the use of Bayesian networks (Needham et al., 2006b) in integrating information, expressing relationships and making inferences or predictions on biological problems, motivated by data generation in genomics and proteomics. We have already successfully applied Bayesian networks to two problems in which we have previous experience. In the first instance, surface patch analysis was combined with a Bayesian network to predict protein-protein binding sites (Bradford et al., 2006) with a success rate of 82% on a benchmark dataset of 180 proteins, improving by 6% on previous work and well above the 36% that would be achieved by a random method. Interestingly, a comparable success rate was achieved even when evolutionary information was missing, suggesting that, in most cases, only chemical and physical surface properties are required for accurate prediction. Next, we used Bayesian networks to predict the functional consequences of missense mutations on proteins (Needham et al., 2006a). Exploiting the ability of the Bayesian network to handle missing data automatically, we found that structural information is significantly more discriminatory than evolutionary information in this classification task and on the dataset used. Indeed, the top three strongest connections with the class node in the network all involved structural nodes. We therefore derived a simplified Bayesian network that used just these three structural descriptors, with comparable performance to that of an all node network. Currently, we are using Bayesian networks to integrate heterogeneous data sources including sequence motif, protein-protein interaction (PPI) and gene expression data to assign functions (described by the Gene Ontology) to proteins of Arabidopsis thaliana. There are over 30000 unique gene products in Arabidopsis. However, 47% of these have an unknown molecular function, and 49% and 64% have yet to be assigned to a cellular compartment and biological process respectively. We aim to assign GO terms in all three of these functional categories. Bayesian networks are particularly suitable to this problem as they can handle the noisy and uncertain data, and relate the functional categories.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

 The Quantification of Uncertainties in Production Prediction Using Integrated Statistical and Neural Network Approaches: An Iranian Gas Field Case Study

Uncertainty in production prediction has been subject to numerous investigations. Geological and reservoir engineering data comprise a huge number of data entries to the simulation models. Thus, uncertainty of these data can largely affect the reliability of the simulation model. Due to these reasons, it is worthy to present the desired quantity with a probability distribution instead of a sing...

متن کامل

Automatic classification of highly related Malate Dehydrogenase and L-Lactate Dehydrogenase based on 3D-pattern of active sites

Accurate protein function prediction is an important subject in bioinformatics, especially wheresequentially and structurally similar proteins have different functions. Malate dehydrogenaseand L-lactate dehydrogenase are two evolutionary related enzymes, which exist in a widevariety of organisms. These enzymes are sequentially and structurally similar and sharecommon active site residues, spati...

متن کامل

Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks

Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...

متن کامل

Protein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches

DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...

متن کامل

A Hybrid Business Success Versus Failure Classification Prediction Model: A Case of Iranian Accelerated Start-ups

The purpose of this study is to reduce the uncertainty of early stage startups success prediction and filling the gap of previous studies in the field, by identifying and evaluating the success variables and developing a novel business success failure (S/F) data mining classification prediction model for Iranian start-ups. For this purpose, the paper is seeking to extend Bill Gross and Robert L...

متن کامل

Propensity based classification: Dehalogenase and non-dehalogenase enzymes

The present work was designed to classify and differentiate between the dehalogenase enzyme to non–dehalogenases (other hydrolases) by taking the amino acid propensity at the core, surface and both the parts. The data sets were made on an individual basis by selecting the 3D structures of protein available in the PDB (Protein Data Bank). The prediction of the core amino acid were predicted by I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007